arXiv:1501.06412vl [cs.IR] 26 Jan 2015 


The Anatomy of Relevance 


Topical, Snippet and Perceived Relevance in Search Result Evaluation 


Aleksandr Chuklin* Maarten de Rijke 

University of Amsterdam, Amsterdam, The Netheriands 

a.chuklin, derijke(a)uva.nl 


ABSTRACT 

Currently, the quality of a search engine is often determined 
using so-called topical relevance, i.e., the match between 
the user intent (expressed as a query) and the content of 
the document. In this work we want to draw attention to 
two aspects of retrieval system performance affected by the 
presentation of results: result attractiveness (“perceived rel¬ 
evance”) and immediate usefulness of the snippets (“snippet 
relevance”). Perceived relevance may influence discoverabil¬ 
ity of good topical documents and seemingly better rankings 
may in fact be less useful to the user if good-looking snippets 
lead to irrelevant documents or vice-versa. And result items 
on a search engine result page (SERF) with high snippet 
relevance may add towards the total utility gained by the 
user even without the need to click those items. 

We start by motivating the need to collect different aspects 
of relevance (topical, perceived and snippet relevances) and 
how these aspects can improve evaluation measures. We 
then discuss possible ways to collect these relevance aspects 
using crowdsourcing and the challenges arising from that. 

Categories and Subject Descriptors 

H. 3.3 [Information Storage and Retrieval]: Information 
Search and Retrieval 

I. INTRODUCTION 

For decades the main evaluation paradigm for search engines 
was the Cranfield methodology [^ . In a typical setting of a 
TREC conference, the documents are evaluated by human 
raters who assign relevance labels based on their judgement 
about the relevance of the document to the user’s topic of 
interest, expressed as a query. A graded relevance scale is 
typically used with topical relevance labels ranging from 0 
to 4 or from irrelevant to highly relevant. 
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These relevance labels can be obtained either from trained 
experts or using a crowdsourcing approach. Either way, 
cases of disagreement have to be addressed, and those are 
usually treated as raters’ mistakes, but may also arise from 
different interpretations of the user intent or the notion of 
relevance. In a traditional evaluation approach a single rel¬ 
evance label is chosen for each document-topic pair. These 
labels are then aggregated to SERP-level quality measures 
such as DCG or ERR [2]. By using additional inputs 
from raters, we can (a) refine these quality measures and 
(b) better understand the performance of retrieval systems. 

2. RELATED WORK 

The idea to separate perceived and topical relevance was 
suggested by [3] while designing the DBN click model. Un¬ 
like earlier click models, it suggests that the likelihood of 
a user clicking a document depends not on the topical rele¬ 
vance of the document, but rather on its perceived relevance, 
since the user can only judge based on the result snippet. 
This idea was later picked up by |12| who showed that while 
topical and perceived relevance are correlated, there is a 
noticeable discrepancy between them. They performed a 
simulated experiment by modeling the user click probability 
and showed that taking it into account would lead to sub¬ 
stantially different ordering of the systems participating in 
a TREC Web Track. 

The idea to separate out snippet relevance appears after the 
introduction of good abandonment m- cases when users 
abandon a search result page without clicking any results 
and yet they are satisfied. This may be due to the SERF be¬ 
ing rich with instant answers [4], e.g., a weather widget or a 
dictionary box, or due to the fact that a query has a precise 
informational need, that can easily be answered in a result 
snippet [5]- In fact, as was shown by m a big portion of 
abandoned searches was due to a pre-determined behaviors: 
users came to a search engine with a prior intention to find 
an answer directly on a SERF. This is especially true when 
considering mobile search where the internet connection can 
be slow or the user interface is less convenient to use. We 
complement these works by arguing that good and relevant 
snippet does not necessarily lead to a complete good aban¬ 
donment, but rather represents an aspect of utility gained 
by the user that is currently ignored. 

3. APPLICATION TO EVALUATION 

As was suggested by [I], many evaluation metrics, including 
DCG and ERR may be viewed as based on a click model. 



This was further refined by where a recipe of converting 
any click model into a metric was presented: 

N 

uMetric = ^ P{Ck = 1) • Rk, (1) 

k=l 

where Rk is the (topical) relevance of the fc-th document 
in the ranking, and P{Ck = 1) is the probability that the 
user will click on that document. Depending on the user 
model, the click probability may depend on attractiveness 
parameters. This is where we can use perceived relevance 
labels Ak (attractiveness). For example, for a metric based 
on the DCIVI model we haveQ 

N k-1 

uDCM =Y,a{Ak)Y\{l-a{Ai)si)-Rk, (2) 

i=l 

where a{A) is a list of parameters, one for each possible 
value of perceived relevance label A; Si is another list of 
parameters, one for each value of the document position i. 

Further, if we want to use snippet relevance labels Sk, we 
introduce a metric of the utility gained from the SERF itself 
similar to ©: 

N 

uMetrics = ^ P{Ek = 1) - Sk, (3) 

fc-i 

where P{Ek = 1) is the probability that the user examines 
the fe-th document. Again, for DCM that would lead us to: 

JV k-1 

uDCMs = sn (1 - a(Ai)si) • S'*,. (4) 

k=li=l 

To summarize, we showed that by collecting perceived, topi¬ 
cal and snippet relevance we can refine system quality mea¬ 
sures (eq. (m, Q). To estimate the effect of this rehnement 
one can compute correlations with online click metrics sim¬ 
ilar to [6] or with side-by-side comparison judgements col¬ 
lected using independent set of raters. 

4. GATHERING JUDGEMENTS 

Now that we have argued that perceived, topical and snip¬ 
pet relevance are potentially valuable dimensions of assess¬ 
ing system quality, how do we gather the required judge¬ 
ments? Firstly, we believe, that the topical relevance def¬ 
inition used by TREC raters is time-tested and hence can 
be used without modihcation. Secondly, snippet relevance 
can be treated as document topical relevance with document 
replaced by its snippet. We also need additional messaging 
for the raters explaining to them why the “documents” are 
so short to avoid undervalued scores. In order to prevent 
the raters from confusing this task with perceived relevance 
judgement, we may hide the fact that they are judging click- 
able snippets and just refer to them as short summaries0 
Similar ratings were collected by [^, where three possible 
answers were offered to the raters: the snippet “answers the 
user question,” “answers the question only partially,” “does 
not answer the question.” Third and finally, perceived rele¬ 
vance is a new task that has to be formulated by explaining 

9^ A similar but more involved equation can be obtained for a met¬ 
ric based on the DBN model O. 
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to the rater the story of a web search and asking her if she 
would click this link in order to find the relevant information 
in the document. The snippet has to be shown without the 
context of the other snippets and without its placement on 
a SERF to avoid position and presentation biases. 

These, then, are the challenges of gathering relevance judge¬ 
ments in the multi-aspect setting that we are proposing: 

• How to make sure the raters do not confuse different 
tasks (topical, snippet, perceived relevance)? 

• How do we treat special SERF items such as images, 
instant answers or interactive tools? 

• What influence does the query category have on the 
difficulty of the task? For example, snippet relevance 
does not make sense for navigational queries. 

5. CONCLUSION 

This paper advocates for the need to review the notion of rel¬ 
evance in to order improve evaluation as well as understand 
the anatomy of relevance. We believe that after performing 
initial experiments and collecting feedback from the raters, 
we can address the challenges outlined above and derive a 
judgement procedure that will allow us to collect all three 
aspects of relevance, rehne system performance evaluation 
and get deeper insights into the foundation of relevance. 
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